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Abstract 

Background: Population differentiation is the result of demographic and evolutionary forces. Whole genome datasets 
from the 1000 Genomes Project (October 2012) provide an unbiased view of genetic variation across populations from 
Europe, Asia, Africa and the Americas. Common population-specific SNPs (MAF > 0.05) reflect a deep history and may 
have important consequences for health and wellbeing. Their interpretation is contextualised by currently available 
genome data. 

Results: The identification of common population-specific (CPS) variants (SNPs and SSV) is influenced by admixture 
and the sample size under investigation. Nine of the populations in the 1000 Genomes Project (2 African, 2 Asian 
(including a merged Chinese group) and 5 European) revealed that the African populations (LWK and YRI), 
followed by the Japanese (JPT have the highest number of CPS SNPs, in concordance with their histories and 
given the populations studied. Using two methods, sliding 50-SNP and 5-kb windows, the CPS SNPs showed distinct 
clustering across large genome segments and little overlap of clusters between populations. iHS enrichment score and 
the population branch statistic (PBS) analyses suggest that selective sweeps are unlikely to account for the clustering 
and population specificity. Of interest is the association of clusters close to recombination hotspots. Functional analysis 
of genes associated with the CPS SNPs revealed over-representation of genes in pathways associated with neuronal 
development, including axonal guidance signalling and CREB signalling in neurones. 

Conclusions: Common population-specific SNPs are non-randomly distributed throughout the genome and are 
significantly associated with recombination hotspots. Since the variant alleles of most CPS SNPs are the derived 
allele, they likely arose in the specific population after a split from a common ancestor. Their proximity to genes 
involved in specific pathways, including neuronal development, suggests evolutionary plasticity of selected 
genomic regions. Contrary to expectation, selective sweeps did not play a large role in the persistence of 
population-specific variation. This suggests a stochastic process towards population-specific variation which 
reflects demographic histories and may have some interesting implications for health and susceptibility 
to disease. 
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Background 

The global diversity of human genomes is the outcome of 
a series of demographic and evolutionary events including 
migration, bottlenecks, admixture, population isolation, 
natural selection and genetic drift which occurred in dif- 
ferent parts of the world at various time points in history 
[1-3]. Genomic signatures of many of these events have 
been preserved in the genomes of different populations 
and play a pivotal role in uncovering demographic histor- 
ies in addition to understanding health and disease [4,5]. 
In the last decade, two major large consortium based ef- 
forts; the HapMap project, the Human Genome Diversity 
Project (HGDP), as well as several other studies, based on 
genotyping of single nucleotide changes, have attempted 
to catalogue the genetic variations that exist between 
individuals of a population as well as within different 
populations across continents [6-11]. 

Data from these studies on genetic diversity have been 
instrumental in estimating the origin and history of 
different contemporary populations as well as shedding 
light on the evolutionary relationship between them 
[12]. Moreover, the genotype data from these studies have 
been subjected to various computational techniques to 
derive estimates of population sizes and divergence times 
for the major demographic events in human history, 
which in many cases have been found to be in agreement 
with evidence from existing historical accounts and arch- 
aeological records [13,14]. However, these studies were 
based on a fixed number of single nucleotide polymor- 
phisms (SNPs) which had clear ascertainment bias (the 
SNPs included in the genotyping platforms were selected 
on the basis of their occurrence and frequencies primarily 
in European populations), therefore it was difficult to 
reliably assess the nature and extent of genomic diver- 
sity that exists among different populations from these 
studies [15]. 

The next major wave of information about genetic and 
genomic diversity in human populations came from 
studies based on exome and whole genome sequencing 
[16-19]. The 1000 Genomes Project, for example, in 
addition to identifying millions of novel SNPs and more 
than a million short structural variants (SSVs), showed 
that rare variants account for a large majority of the exist- 
ing genetic diversity between individuals as well as within 
populations [17,18]. Moreover, it was suggested that there 
is an excess of rare and deleterious mutations in human 
genomes, probably resulting from exponential population 
growth and weak purifying selection [17,18]. Studies based 
on deep sequencing of selected regions from thousands of 
individuals further show that the majority of rare coding 
variants, with allele frequencies lower than 0.0005, are 
also population-specific and potentially deleterious [19]. 
In addition to thousands of contemporary human ge- 
nomes, sequencing of many archaic genomes has also 



been performed recently which has provided evidence 
for archaic admixture in non- African genomes [20-22]. 
Such admixture might also be present in at least some 
of the African populations [23,24]. These studies taken 
together have not only resulted in a paradigm shift in 
our understanding of various aspects of human genomic 
diversity but also provided necessary data for addressing 
numerous other questions related to human genome 
evolution. 

SNPs and structural variants are broadly classified into 
common and rare based on minor allele frequencies 
(MAF). A widely used cut-off for defining rare SNPs 
being a MAF of less than 0.05 [17]. However, this cut- 
off is pragmatic in nature and does not have any special 
biological relevance. Although differences in SNP allele 
frequencies might be influenced by various demographic 
factors like selection and population size, time is the major 
determinant in the rise or fall of allele frequencies. 
Mathematical estimates suggest most of the common 
SNPs to have originated thousands of years ago and 
therefore to have a wider geographic distribution in 
contrast to rare variants which are mostly more recent 
and geographically restricted [25]. The rare and common 
variants therefore allow us to investigate events at dif- 
ferent time scales of demographic histories. The rela- 
tive phenotypic importance of common and rare SNPs 
is highly debated [26]. Nevertheless, while most of the 
Mendelian traits and deleterious mutations have been 
shown to be rare; several studies suggest some continuous 
traits like height might well be explained in terms of com- 
mon SNPs [27,28]. 

SNPs and structural variants are often classified into 
'private' and 'shared' based on their distribution in a single 
population or a range of populations. The term private 
however might imply different things based on the con- 
text, for example, a SNP might be private to an individual 
or a family, or to a population (monomorphic in all but 
one population; also referred to as 'population private') or 
to an ancestral group. Therefore, we will use the term 
'population-specific' for the SNPs that have been found to 
occur only in a single population. Although private SNPs 
have not been shown to be involved in major phenotypic 
traits or common diseases, population-specific SNPs might 
well be important in ascribing characteristic phenotypes and 
disease susceptibility/protection to a population [29,30]. 

Population specificity of genetic variants, if the popula- 
tion-specific allele is the derived allele, might originate 
from two different scenarios: in the first scenario, a variant 
allele originates in a single population and remains re- 
stricted to the population of its origin. The second sce- 
nario is that the variant originated before differentiation of 
populations, survives in only a single population, and gets 
eliminated from other populations. In cases where the 
population-specific allele is the ancestral allele, both the 
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alleles are estimated to have evolved far back in evolutionary 
history and the derived allele replaces the ancestral allele 
in all but one of the populations, probably through select- 
ive sweeps. Alternatively, in some cases, the assignment of 
ancestral state may be incorrect. The other possible sce- 
nario by which population-specific SNPs might originate 
is by admixture with populations which are not included 
in the study or even populations which are no longer 
extant. Therefore, in addition to the functional role of 
these SNPs, the population-specific SNPs might also 
play an important role in characterizing ancestry and 
understanding demographic histories [31,32]. For example, 
on a genome wide scale the number of population-specific 
SNPs in a population would be expected to be related 
to the age of the population and also to reflect demo- 
graphic events like bottlenecks, geographical isolation 
and admixtures. 

Despite their potential significance, population-specific 
SNPs have not been studied extensively. Previous HapMap 
data based studies on population-specific SNPs have been 
able to identify only a small number of population-specific 
SNPs due to ascertainment bias of the genotyping plat- 
form [6,33-35]. The availability of unbiased whole genome 
sequence data from sources like the 1000 Genomes 
project, however, has now made the identification and 
characterization of population-specific SNPs on a genome 
wide scale possible. Moreover, sequencing-based studies 
have shown population-specific SNPs to be one of the 
major components of genetic diversity within populations 
[17-19,36]. A deeper understanding of population-specific 
variations, their genomic distribution and potential func- 
tional relevance is important. 

We have used 1000 Genomes sequence data (release 
October 2012), including more than one thousand indi- 
viduals from 14 populations spanning Europe, Asia, Africa 
and America, to identify SNPs and structural variants that 
are private or specific to each population and to study 
their genomic distribution and potential functional rele- 
vance [17,18]. However, as the population sample sizes are 
relatively small (<100) and the sequencing is low coverage 
(4X-6X) for most of the 1000 Genomes data, low fre- 
quency alleles are harder to accurately identify and may be 
incorrectly identified as population-specific [17,18]. We 
have therefore focused our study on common population- 
specific (CPS) SNPs as higher MAF population-specific 
SNPs are expected to be more informative and less likely 
to be incorrectly annotated as population-specific in 
this dataset. We evaluated the frequency distribution 
of population-specific SNPs identified in our study in 
the context of the generally accepted model of population 
migration and differentiation. We analysed the genomic 
distribution of these SNPs using fixed length and fixed 
bin window scan based approaches to identify potential 
biases in genomic distribution of CPS SNPs. The CPS 



SNP-enriched genomic regions in different populations 
were then compared to test whether their preferential 
localization has overlaps across different populations. 
Analyses of signatures of selection and the distribution 
of recombination hotspots were performed in the CPS 
SNP-enriched genomic regions to determine the extent 
of involvement of these processes in generating CPS 
SNP-enriched genomic regions in different populations. 
Functional enrichment analysis of genes containing the 
CPS SNP was performed and the enriched functional clas- 
ses for different populations were compared to identify 
possible functional trajectories in population-specific SNP 
evolution. 

Results and discussion 

Identifying SNPs unique to each population 

One of the major achievements of the 1000 Genomes 
project has been the identification of numerous novel SNPs 
across different populations [17,18]. The sequence-based 
approach employed in the 1000 Genomes project in 
contrast to the previous genotyping-based approaches 
like HGDP and HapMap, provides an unbiased estimate 
of human genetic variation across many populations glo- 
bally [6,7,17,18]. We have used the most recent version 
(October 2012) of the 1000 Genomes data to identify 
SNPs which are observed to be unique to each of the 
individual study populations [18]. These SNPs were 
categorized into CPS SNPs and rare population-specific 
(RPS) SNPs based on a MAF cut-off of 0.05. SNPs with 
MAF >0.05 were considered as CPS SNPs while SNPs 
with lower MAFs were considered as RPS SNPs. Although 
more than 99% of population specific SNPs in the 1000 
Genomes data are RPS SNPs, we have focused our present 
study on CPS SNPs because the sample sizes (around 
90-100 individuals for each population) and low cover- 
age sequencing (around 4X for most of the genomic re- 
gions) used for generating the data make it difficult to 
reliably ascertain the population specificity of low allele 
frequency SNPs. Moreover, as these SNPs have a MAF of 
at least 0.05 they are less likely to be personal SNPs or the 
result of recent demographic events. 

The present 1000 Genomes data contain two African 
(YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, 
Kenya)), three Asian (JPT (Japanese in Tokyo, Japan), CHB 
(Han Chinese in Beijing, China) and CHS (Han Chinese 
South)), three American (MXL (Mexican Ancestry in 
Los Angeles, CA, USA), PUR (Puerto Ricans in Puerto 
Rico) and CLM (Colombians in Medellin, Colombia)), 
5 European (IBS (Iberian Populations in Spain), GBR 
(British from England and Scotland), CEU (Utah residents 
with ancestry from northern and western Europe), FIN 
(Finnish in Finland) and TSI (Toscani in Italia)) and 
one admixed African (ASW (African Ancestry in SW 
USA)) population. The frequencies of common and rare 
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population-specific SNPs in these populations have 
been summarized in Figure 1A and Figure IB, respect- 
ively. Although the numbers of common and rare SNP 
differ by many folds, there are some broad similarities 
in the distribution patterns of the CPS SNPs and RPS 
SNPs. 

For example, the highest number for both CPS SNPs 
and the RPS SNPs was observed in the LWK population 
followed by the Japanese (JPT) population. Interestingly, 
in contrast to the large number of RPS SNPs observed, 
just a few CPS SNPs were found to occur in the Chinese 
populations (CHB and CHS). This observation is consist- 
ent with the fact that these populations have a similar geo- 
graphic origin, and the differentiation between them 
probably started little more than a thousand years ago 
with the Southward migration of the Northern Han popu- 
lation [37-39]. In spite of the pronounced divergence of 
these populations, reflected in the high frequency of RPS 
SNPs and has also been observed in many previous stud- 
ies, the relatively recent divergence has not allowed many 
of the population-specific alleles to reach frequencies of 
0.05 [37-39]. As our aim was to identify common SNPs 
which are unique in different populations, and we know 
that these populations have a common recent origin, we 
merged the two Chinese populations CHB and CHS into 
a single population (named CHINESE for this study). We 
recognise that this approach would not be suitable for a 



similar analysis with rare SNPs due to the extent of diver- 
gence that these populations have undergone recently. 

One of the concerns with using all the current popula- 
tions of the 1000 Genomes data for identifying population- 
specific SNPs is the inclusion of populations with known 
recent admixture, such as ASW and MXL (Supplementary 
Figures S4 and S9 from reference 18). The inclusion of 
these admixed populations may mask the true population 
specificity of SNPs. In order to identify SNPs which are 
truly unique to populations, ASW and the three American 
populations (MXL, PML and PUR), which are known to 
have undergone a significant amount of admixture in the 
recent past, were removed from the dataset [18]. It is worth 
noting, however, that the MXL, CLM and PUR populations 
contain a few hundred common SNPs which were not ob- 
served in any other continent/population. As indicated by 
previous population structure analyses, these populations 
harbour a significant Native American genetic component; 
and the order of Native American admixture in these 
three populations is approximated by the total number of 
population-specific SNPs in these populations (highest in 
MXL followed by CLM and then PUR) [18]. It would be 
an interesting follow-up study to isolate the population- 
specific SNPs of Native American origin and to function- 
ally assess their significance in these populations. 

The trimming and rearrangement of the popula- 
tion datasets resulted in 9 potentially independent and 
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Figure 1 Population-specific SNPs in the 1000 genomes data. The number of population-specific SNPs for each of the 14 populations for 
common (A) and rare (B) SNPs are shown in (A) and (B). As the dataset includes admixed and related populations we removed the four known 
admixed populations (ASW, CLM, PUR, and MXL) and merged the two Chinese populations CHS and CHB into a single CHINESE population. The 
number of common (C) and rare (D) population-specific SNPs in the remaining 9 populations were retained for further analysis. The European 
populations are shown in orange, Asian populations in purple and the African populations in light green. The American and the admixed African 
populations are shown in blue. 
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essentially non-admbced populations for further investiga- 
tion in the current study. The distribution of the CPS 
SNPs and RPS SNPs for each population was recalculated 
considering these 9 populations only, and has been sum- 
marized in Figure 1C and D. The list of SNPs which were 
observed to be unique to each population along with their 
frequencies in the 14 study populations has been provided 
in Additional file 1. Although the removal of the admixed 
populations significantiy increased the count of CPS SNPs 
for all the populations, the detected trends, for example 
the highest number of SNPs in LWK, followed by JPT, IBS 
and FIN, are similar in both sets (Figure 1A, C, B and D). 
An interesting exception is the YRI population, where the 
number of YRI specific CPS SNPs goes up by folds with 
the removal of the admixed African American population. 
This result concurs with the known history of recent mi- 
gration and admixture of the Western African populations 
in North America [40]. However, in spite of this increase 
in the number of the CPS SNPs in the YRI, after removal 
of admixed populations, they still have only about half the 
number of CPS SNPs observed in LWK. This difference 
is, however, not surprising in view of the fact that a num- 
ber of different populations, which most probably include 
the LWK along with other Bantu-speaking populations, 
have migrated to East Africa at different time points in 
history [41-44]. The migration of several different popula- 
tions along with the presence of indigenous East- African 
Khoesan-speaking populations in this region, which has 
been suggested to have contributed to the population 
differentiation in East-Africa, might also explain the 
high frequency of CPS SNPs and RPS SNPs observed in 
the LWK [41,42]. 

The relatively high frequency of CPS SNPs as well as 
RPS SNPs in the Japanese population is notable. It is well 
known that the modern Japanese population contains ad- 
mixtures of at least two distinct genetic components; the 
old migrants who migrated to the Japanese Archipelago 
approximately 30,000 years ago and the new migrants that 
reached Japan only about a couple of thousand years ago 
[45-47]. It would be interesting to study how far the 
unique components of both these, and perhaps other mi- 
grating populations, add up to generate the high RPS 
SNPs and CPS SNPs observed in the JPT population. 

In addition to population histories, the sample size is 
also a strong determinant of how many variants and 
unique variants are observed in a population. For ex- 
ample, the huge increase in the frequency of RPS SNPs 
in the Chinese populations after the merger (Figure ID) 
is also an outcome of the increase in sample size due 
to merging of the populations. As the sample size for 
the population has doubled the frequency of detection 
of RPS SNPs has increased proportionately and similar 
changes can be expected to be observed in other popula- 
tions in the future as more samples from these populations 



are sequenced. Similarly, the lack of RPS SNPs in the IBS 
population in comparison to other populations can be 
ascribed to the inclusion of only 14 IBS samples in the 
current 1000 Genomes data set. It can be expected that 
as more samples are sequenced the fraction of RPS SNPs 
in this population will be in line with other populations. 

We found that three of the European populations (CEU, 
GBR and TSI) have only a handful of common SNPs 
unique to them in contrast to a few hundred thousand rare 
SNPs. While this makes sense in terms of demographics 
[48,49] and probable admixtures, it might also be a result of 
treating these related or partially admixed populations sep- 
arately. Approaches that group these populations together, 
based on population histories, might lead to the identifica- 
tion of some CPS SNPs in these groups too. While the high 
frequency of CPS SNPs in the Finnish population (FIN) can 
be interpreted in terms of multiple genetic components 
and demographic factors like isolation, migration and ad- 
mixture, which is reflected in their distinctive distribution 
in the European principal component analysis (PCA) plots 
in other studies [18,50,51], the high frequency of CPS SNPs 
in the Spanish (IBS) population needs to be treated with 
greater caution as the number of individuals sequenced for 
this population is only 14. Many of the SNPs which seem 
to be common (MAF > 0.05) in the IBS in the present data 
might turn out to be rare once other samples from this 
population are sequenced. 

Although our analysis is focused on SNPs, we studied 
the distribution of population-specific short structural 
variants (SSVs) to see whether their distribution in differ- 
ent populations concurs with that of the SNPs. Figure 2 
shows the distribution of the common population specific 
structural variants (CPS SSVs) and rare population-specific 
structural variants (RPS SSVs). Interestingly, the relative 
prevalence of the SSVs across populations shows high 




CEU FIN GBR IBS TSI CHINESE JPT LWK YRI 



Figure 2 Population-specific common (MAF > 0.05) and rare 
(MAF < 0.05) short structural variants (SSVs). 
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concordance with that of SNPs. However, the numbers 
observed for rare and common SSVs are similar in con- 
trast to the few fold difference observed in the number of 
common and rare SNPs. 

To classify the CPS SNP variant alleles into ancestral 
and derived (based on multi-species alignment) the ances- 
tral/derived information for alleles in the 1000 Genomes 
vcf file was used [18]. As expected, more than 80% of the 
population-specific alleles were found to be the derived al- 
lele (Figure 3) indicating that most of these alleles likely 
arose in the individual populations after their divergence 
from other populations. 

The relative prevalence of the CPS SNPs (as well as RPS 
SNPs and SSVs) across populations, therefore, shows high 
concordance with what can be expected on the basis of 
the generally accepted model of population divergence 
and the relationships between populations. However, as 
has been demonstrated, the number of population-specific 
SNPs observed in any population, in addition to population 
histories, is also influenced by factors like sample size and 
number of related and/or admixed populations included 
in the study. The removal of the admixed African and 
American populations almost doubled the number of com- 
mon SNPs which were detected to be population-specific 
in the other 9 populations, indicating how important the 
detection and control of admixture is for identifying what is 
truly population-specific. While the lack of CPS SNPs in 
most European populations is not very surprising consid- 
ering their population histories, as well as the number of 
populations (5 European populations in contrast to only 2 
African and 3 Asian population) included in the dataset, it 
would be interesting to see how strongly the inclusion of 
other populations from Asia and Africa change the num- 
ber of population-specific SNPs as new data pour in. 



Genomic distribution of CPS SNPs 

The distribution of SNPs has for long been known to be 
non-random across the genome [52-55]. Recent studies 
have further suggested that the rates of mutations in a 
genomic region in addition to the genomic context might 
also depend on the presence of repeat sequences and even 
existing SNPs in the region [56,57]. Moreover, genomic re- 
gions where genomes from different global populations 
differ very strongly from each other have also been ob- 
served [58]. Given this background it was interesting 
to investigate whether the CPS SNPs, as delineated in 
our study, also show clustered occurrences across the 
genome. To identify possible biases in the distribution 
of CPS SNPs in each population and test whether the 
enriched regions are similar in different populations 
we used a sliding window based scan. Although sliding 
window based approaches have been widely used to 
identify clusters within genomic regions [59,60], this 
approach has been shown to find some false positive 
clusters in some cases [61]. Therefore, to minimize such 
false positive results we have used two different sliding 
windows based approaches and used a conservative 
p-va\ue cut-off for delineating clusters of CPS SNPs in 
each population. 

SO-SNP windows 

In the first approach, a window was defined as a set of 
50 contiguous SNPs and each chromosome was scanned 
along the 50-SNP windows (with a slide of 50 SNPs per 
step) separately for each population. In each step the 
fraction of CPS SNPs in each window was recorded and 
compared to an expected value, based on the occurrence 
of CPS SNPs on the corresponding chromosome for the 
particular population. The statistical significance of the 




FIN IBS CHINESE JPT LWK YRI 

■ Ancestral Derived ■ Not Sure ■Undefined 

Figure 3 Classification of population-specific SNP alleles into ancestral and derived. The SNPs for which no ancestral state information could be 
detected are shown as "Undefined" whereas the SNPs for which the ancestral state could not be detected with confidence are shown as "Not Sure". 
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observations was estimated using cumulative hyper geo- 
metric ^-values calculated for each window. The results 
clearly identified specific regions of the genome to be 
enriched with CPS SNPs in each population. We detected 
655 CPS SNP-enriched windows/regions in the 6 popula- 
tions (Table 1, Additional file 2). The populations CEU, 
TSI and GBR were not analysed due to a paucity of CPS 
SNPs. As for the number of CPS SNPs in the population, 
most CPS SNP-enriched windows were observed in the 
LWK, followed by YRI and JPT. It is interesting to note 
that, although both FIN and IBS contain a much greater 
number of CPS SNPs in comparison to the CHINESE 
population, which contains 24 enriched windows, only 
three CPS SNP-enriched windows were detected in the 
IBS population and a single such window was detected 
in the FIN population. The two highest-scoring win- 
dows detected for each population using this scan are 
shown in Table 2. In the highest-scoring windows for 
both LWK and YRI more than 50% of the SNPs were 
found to be CPS SNPs. 

5-kb windows 

The second approach was to use a sliding window of 5 
kilobases (kb). This approach, in addition to identifying 
CPS SNP-enriched regions, provides a more direct way 
to identify possible overlap within CPS SNP-enriched 
windows across populations. Using this scan, 565 5-kb 
regions were found to be significantly enriched for CPS 
SNPs in the 6 populations (Table 1). For each of the 
populations there was a very significant amount of overlap 
between the regions identified by the two sliding window 
based approaches (Table 1). The comparison of enriched 
windows identified using both the sliding window ap- 
proaches shows that there is almost no overlap within 
the CPS SNP-enriched regions in these six populations 
(Figure 4). The second interesting aspect revealed by 
both the 50-SNP windows and 5-kb windows based ap- 
proaches is that for many genomic regions the run of en- 
richment extends far beyond a single or couple of windows. 
The regions containing the longest stretches of enriched 
50-SNP windows have been summarized in Table 3. 



Interestingly, the longest blocks and the highest scoring 
windows show significant overlap in some populations 
(Tables 2 and 3). For example, one of the longest blocks as 
well as one of the most CPS SNP dense windows was de- 
tected near the solute carrier organic anion transporter 
family, member 1B1 (SLC01B1) gene in the YRI popula- 
tion. Sequence variants identified in the SLC01B1 gene 
have been associated with altered transport activity and it 
has been shown that genetic polymorphisms in the gene 
have an impact on the inter-individual variability of 
the pharmacokinetics and pharmacodynamics of spe- 
cific drugs [62,63]. Previous studies have also observed 
unique genetic diversity in the SLC01B1 gene between 
populations with the greatest diversity among African 
populations [62,63]. Similar overlap was also observed 
in the RAP1 interacting factor homolog (RIF1) gene in the 
CHINESE population. Additional files 2 and 3 contain the 
full list of windows identified using these approaches, and 
the SNPs included in them. Interestingly, despite fewer 
CPS SNPs and the presence of only a few enriched win- 
dows, two significantly long stretches of enrichment are 
observed in the CHINESE population. Similarly, although 
the number of enriched windows in Japanese is less than 
one third of that of the YRI, the Japanese population seem 
to harbour much longer enriched window stretches in 
comparison to the YRI population, and this enrichment 
cannot be explained solely on the basis of increased LD in 
the Japanese compared to the YRI These observations 
taken together indicate that the bias in distribution of CPS 
SNPs is largely independent of the size of the datasets and 
the enriched windows or window blocks may represent 
genomic regions significant in terms of function or popu- 
lation histories. 

Possible origin of CPS SNP-enriched genomic regions 

Clusters of SNPs with highly differentiated allele frequen- 
cies, within and between species, have been observed in 
numerous previous studies [64-66]. The origin of such 
clusters has been ascribed to various demographic factors 
like genetic drift and gene flow as well as forces like selec- 
tion and local adaptations [67-69]. The CPS SNP clusters 



Table 1 Genomic regions enriched in common population-specific (CPS) SNPs identified using 50-SNP and 5-kb window 



approaches 


Population 


Sample size 


CPS SNPs 


50-SNP window 


5-Kb window 


Overlap 


LWK 


97 


34390 


357 


311 


237 


YRI 


88 


18809 


216 


188 


138 


JPT 


89 


10326 


64 


47 


41 


CHINESE 


197 


863 


24 


28 


21 


FIN 


93 


3178 


1 


1 


0 


IBS 


14 


5971 


3 


1 


0 


Total 




73537 


665 


576 


437 



The populations CEU, TSI and GBR were excluded from this analysis due to low numbers of CPS SNPs in these populations. 
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Table 2 Best common population-specific (CPS) SNP-enriched windows for each population 



Population 


Chr 


Start 


End 


No. of SNPs 


P-value 


Gene or flanking genes 


YRI 


18 


6266587 


6271281 


26 


436E-66 


L3MBTL4 


YRI 


12 


21347746 


21353031 


25 


9.95 E-60 


5LC01B1 


LWK 


10 


26690276 


26697294 


26 


1.19E-59 


GAD2 - APBB1IP 


I \A/k' 

l_VVI\ 


D 


1 3T3QQ1 S3 7 

i jzjyy I O/ 


I DZ^tU^t/ OO 


ZD 




a/dud? Ar&ni i 


JPT 


2 


38809530 


38818371 


20 


5.77E-46 


HNRPLL 


JPT 


4 


187416152 


187426671 


18 


7.30E-44 


LOC285441-MTNR1A 


CHINESE 


2 


152284636 


152297774 


12 


8.13E-36 


RIF1 


CHINESE 


11 


119411414 


1 1 9420288 


11 


2.13E-33 


LOCI 00499227- PVRL1 


FIN 


16 


86084552 


86093750 


4 


2.45E-09 


IRF8-LOC146513 


IBS 


3 


68079 


77942 


5 


6.40E-1 0 


na-CHLl 


IBS 


19 


52867878 


52878655 


4 


2.77E-08 


ZNF6W 



Population code, genomic coordinates, number of CPS SNPs, p-values and corresponding genes (if window is exonic or intronic) or flanking genes joined by 
a "-" (if the window is intergenic), for up to two best 50-SNP windows for each population. 
Intergenic window for which no flanking gene was found is indicated by "na". 



1 



• CHINESE 

• FIN 



• IBS 



O LWK 



• YRI 



Figure 4 Genomic distribution of common population-specific (CPS) SNP-enriched 5-kb windows. The windows show very little overlap 
between populations and there are many blocks within populations containing contiguous windows of CPS SNP enrichment. 
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Table 3 Longest CPS SNP-enriched 50-SNP window stretch for each population 



Population Chromosome Start End Block length Gene or flanking genes 



YRI 


12 


21343612 


21361661 


6 


SLC01B1 


JPT 


4 


1 87420496 


1 87467709 


11 


MTNR1A 


LWK 


12 


79979498 


80083792 


12 


PAWR 


CHINESE 


2 


1 52268276 


152401521 


14 


RIF1 



Population code, genomic coordinates; number of 50-SNP windows in the block and the related loci are shown for each population. 
No such blocks were observed for the FIN and the IBS populations. 



observed in our study are somewhat similar to the clusters 
which show high allele frequency differentiation within 
populations as they represent genomic regions which vary 
widely across populations. However, there is an inherent 
difference in that in these regions both the SNP compos- 
ition and SNP density is different in a single population 
compared to others. Considering this background it was 
important to investigate if the factors, which are assumed 
to generate clusters of SNPs with highly differentiated 
allele frequencies across populations, are also responsible 
for generating clusters of CPS SNPs. We used different 
computational approaches to test possible involvement of 
selection or increased recombination rates in the origin of 
these clusters. 

Role of selective sweeps 

To determine whether the genomic regions enriched in 
CPS SNPs have an association with selective sweeps, we 
used two different approaches to search for possible signa- 
tures of selection in these regions. The first approach was 
based on the iHS (integrated Haplotype Homozygosity 
Score) statistic, which in principle involves the detection 
of unusually long haplotypes of low diversity as signatures 
of selection [68]. iHS scores for each SNP in the 50-SNP 
windows which were found to be enriched with CPS SNP 
were computed using the program iHS_calc [70] . For each 
50 SNP window we calculated the proportion of SNPs 
with |iHS|>2 which we will call iES (iHS enrichment 
score). The background iHS and iES score distributions 
were estimated on the basis of the iHS score calculated 
from 10,000 random contiguous 50-SNP windows or 
blocks for each population. Based on the background 
distribution, we then estimated the number of 50-SNP 
windows which can be expected to correspond to the top 
1%, 5% and 10% of iES scores for each population. The ob- 
served number for CPS SNP-enriched windows for each 
population which correspond to the top 1%, 5%, and 10% 
iES scores were compared with the number of expected 
windows and the corresponding p-vahie for each observa- 
tion was then estimated using bootstrap resampling. The 
results show that although some of the CPS SNP-enriched 
windows show significant iHS score enrichment, the overall 
distribution does not indicate any significant association of 
selection with these windows (Figure 5A). 




60 
50 

3 

o 40 
c 

s 

■S 30 

HI 
_Q 

£ 20 



l Ovserved 
I Expexted 
-log(P) 



10 

9 



5 -log(P) 



Figure 5 Analysis of potential signatures of selection in the 
common population-specific (CPS) SNP-enriched windows. 

(A) Expected and observed number of iES (iHS enrichment score) 
enriched windows (see Methods for details) in YRI, LWK, JPT and 
CHINESE populations. The number which has been appended to the 
population code indicates the top n th percentage of iHS score 
considered (1 =top 1%; 5 = top 5% and 10 = top 10%). The 
corresponding p-values for enrichment are shown on the right axis. 

(B) Expected and observed occurrences of top 1%, 5% and 10% 
population branch statistic (PBS) scores amongst CPS SNP-enriched 
windows for YRI, LWK, JPT and CHINESE populations. A three letter 
population combination code (say YU) has been used to describe 
the 3 population set used for calculating the PBS score. The first let- 
ter (Y) indicates the population being analysed (YRI in this case). The 
CPS SNP-enriched windows are analysed for this population. The 
second letter (L) indicates the population to which it was compared 
(LWK here) and the third letter (J) indicates the outlier (JPT in this 
case). The number, appended with an underscore to each three let- 
ter dataset name indicates the top n th percentage of PBS score cut- 
off used for analysis (1 =top 1%, 5 =top 5% and 10 = top 10%). 
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One of the concerns about using a centi-morgan (cM) 
based physical map, such as the one used in this study, 
is that the signals for signatures of selection might get 
underestimated as the threshold of iHS > 2 used by Voight 
and colleagues [68] might be too stringent for a cM map 
based analysis. Therefore, we ran two independent sets of 
analysis in which the iES scores were defined on the basis 
of lowered thresholds of iHS > 1.75 and iHS > 1.5, respect- 
ively. However, no distinct enrichment of iHS scores was 
observed even in the lower threshold sets. Results from 
the analysis of the 5-kb windows were also found to be 
very similar to that obtained with the 50-SNP windows. It 
should, however, be kept in mind that iHS in itself might 
not be a very good metric for testing selective sweeps in a 
dataset which is known to contain many CPS SNPs of 
moderate allele frequencies because, unless on a single 
haplotype, these SNPs will have a tendency to disrupt long 
haplotype blocks. The results for the iHS scan, neverthe- 
less, confirm that the CPS SNPs in CPS SNP-enriched 
windows show a complex distribution of SNPs which re- 
sult in complex haplotype architectures, and not a single 
long haplotype. 

To test for selective sweeps on the basis of allele fre- 
quency differentiation rather than haplotype lengths we 
used the population branch statistic (PBS); which has 
been found to be very useful in detecting high altitude 
adaptation-related SNPs in Tibetans relative to Han 
Chinese and Danish populations, as an alternative ap- 
proach for detecting signatures of selective sweeps in CPS 
SNP-enriched windows [71]. PBS can be thought of as an 
estimate of the allele frequency change at a given locus in 
the history of a population since its divergence from an- 
other population. The idea behind this analysis is that if 
we consider two related populations and an outlier popu- 
lation, the allele frequency changes at any locus in these 
two populations should be equidistant (or have similar 
branch length) from the outlier. Therefore loci which 
show high allele frequency differentiation in only one 
of the related populations, reflected by high population 
branch length (and PBS score), may be potential candi- 
dates for selective sweeps. 

For each population, the PBS statistic for each CPS 
SNP-enriched 50-SNP window was calculated using the 
method used by Yi et al. [69]. For the Asian populations 
(JPT and CHINESE) and European populations (IBS and 
FIN), YRI was used as the outlier population. Similarly, 
for the African populations YRI and LWK, the JPT 
population was used as the outlier. Although the choice 
of outlier for the populations might be questionable 
from a population history perspective, the distances within 
these populations suggest that this set can still provide rea- 
sonable estimates of branch lengths. For each 3-population 
set (e.g. YRI-LWK-JPT or JPT-CHB-YRI), we estimated the 
background distribution of the PBS scores, using 10,000 



randomly-selected 50-SNP windows. We then identified 
score cut-offs based on the top 1%, 5% and 10% of the 
background distribution and estimated the number of 
50-SNP windows which can be expected to be in the top 
1%, 5% and 10% PBS score range for a population. The 
number of observed windows in the 1%, 5% and 10% 
range was compared to the expected number and the 
corresponding P-values were estimated using a bootstrap 
analysis. Figure 5B summarizes the PBS score distribution 
for the Asian and African populations. None of the win- 
dows which were found to be enriched with CPS SNPs in 
FIN and IBS were found to be in the top 1%, 5% or 10% 
range for the respective populations and hence were not 
retained for further analysis. It can be seen that, although 
some of the populations have some enrichment of high 
PBS scores in the CPS SNP-enriched windows, their lack 
of statistical significance as well as the overall distribution 
of PBS scores do not suggest that selection is common in 
these regions (Figure 5B). Although there are quite a few 
other tests for detecting selective sweeps [72,73] which 
could have been employed for this dataset and might have 
identified a few more CPS SNP-enriched windows to be 
under selective sweeps, it is unlikely that they would 
change the landscape fundamentally and it can be safely 
concluded that selection is not the major factor causing 
CPS SNP enrichment in certain genomic regions. How- 
ever, the efficiency of existing methodologies for detecting 
signatures of selection in datasets like the current 1000 
Genomes dataset (which contain a large proportion low 
frequency SNPs, sequenced on a low coverage platform) is 
an important concern as genome wide variation in error 
rates might easily mask true signals and generate false 
positive signals of signature of selection. Development of 
parameters and efficient quality control measures well 
suited for identifying signatures of selection in such a 
dataset will significantly contribute to future work in this 
direction. 

Role of recombination rate 

Regions of high recombination have been shown to be 
related to higher SNP densities [74,75]. As the SNP 
densities in the CPS SNP-enriched windows are higher 
in a single population compared to others, we consid- 
ered whether there was any relationship between CPS 
SNP-enriched windows and higher recombination rates. 
To test the association of CPS SNP-enriched genomic 
regions with meiotic recombination rates, we obtained 
recombination hotspots based on the recombination 
maps generated by deCODE [74]. The distribution of 
recombination hotspots from the deCODE recombination 
map using a SRR (sex-standardized recombination rate) 
cut-off of 10 found only a handful of recombination hot- 
spots within the CPS SNP-enriched regions in all popula- 
tions taken together [76]. However, recombination hotspots 
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have been found to vary significantly among popula- 
tions [77,78] and as a population-specific perspective 
of recombination was key for this study, in addition to 
the generalized deCODE recombination map, the linkage 
disequilibrium (LD) based HapMap YRI map (hapMapRe- 
lease24YRIRecombMap) was used to identify recombin- 
ation hotspots and coldspots for the YRI population 
[6,33,34]. Similarly the combined HapMap recombination 
map (hapMapRelease24CombinedRecombMap) was used 
to identify recombination hotspots and coldspots for all 
other populations [6,33,34]. 

We studied the genomic distribution of the recombin- 
ation rates from the YRI-specific map and the genomic 
regions corresponding to the top 1% recombination rates 
were defined as recombination hotspots for YRI. A second 
set of hotspots, likewise, were defined on the basis of 
the top 5% recombination rates. Similarly, two sets of 
coldspots were defined by the lowest 1% and 5% re- 
combination rates. Based on the genomic distribution 
of recombination rates in YRI we estimated the number of 
hotspot sites expected to occur in CPS SNP-enriched 
windows for the YRI population. The observed rates 
were compared with the expected rates and the statis- 
tical significance of enrichment of recombination hot- 
spots were estimated at both 1% and 5% levels. The CPS 
SNP-enriched regions defined on the basis of both length 
(5-kb) and 50-SNP windows were analysed separately. The 
frequency of sites with the top 1% and 5% recombination 
rates in both sets of YRI-specific CPS SNP-enriched 
regions in comparison to the respective background 
distributions of genomics regions with the top 1% and 5% 
recombination rates has been summarized in Figure 6A. It 
is clear that for both kinds of windows and at both levels 
(top 1% and 5%) the recombination hotspots were highly 
enriched in the population specific SNP-enriched genomic 
regions. The analysis of coldspots at both 1% and 5% 
levels, on the other hand, show that these sites are highly 
under-represented in the CPS SNP-enriched regions. A 
similar analysis for other populations using the combined 
map (hapMapRelease24CombinedRecombMap) shows that 
the trend of very significant enrichment of these hotspots 
and significant depletion of the recombination coldspots 
is consistently seen in all populations (Figure 6B). A 
combined analysis of CPS SNP-enriched windows from 
all the populations taken together also shows the same 
trend (Figure 6B). 

Although this analysis shows a very clear trend, as the 
maps used in this study are LD based, further evidence 
in terms of experimentally derived data for at least some 
of these regions will be required to reliably establish the 
relationship between recombination hotspots and CPS 
SNP-enriched windows. Nevertheless, the observed en- 
richment of recombination hotspots in CPS SNP-enriched 
genomic regions hints that high recombination might be 




Figure 6 Recombination rates in common population-specific 
(CPS) SNP-enriched regions. A. The expected and observed 
number of hotspots (HS), defined on the basis of top 1% and 5% 
recombination rates) and coldspots (CS) (defined on the basis of 
lowest 1% and 5% recombination rates) in CPS SNP-enriched regions. 
(A) Recombination rates for the YRI was estimated on the basis of the 
HapMap24 YRI specific map downloaded using the UCSC table 
browser. The distribution of hotspots in regions detected by length 
based (5-kb) and window based (50-SNP) approaches using the top 
1% (indicated with _1) and 5% (shown by _5) recombination rate is 
shown (B) The combined recombination map was used to identify 
whether the observed pattern of distribution of hotspots and cold 
spots in YRI also hold for JPT, LWK and CHINESE population specific 
windows (based on top 5% recombination rates). In addition to 
individual populations, the CPS SNP-enriched windows for all four 
populations taken together (ALL_HS and ALL_CS) are also shown. 



one of the factors contributing to the generation of CPS 
SNP clusters. The presence of recombination hotspot(s) in 
a short genomic region (5 kb or 50 SNP), especially in case 
of a genotype based recombination map like the one used 
here, clearly indicates the LD architecture to be complex 
and the LD blocks to be short within that particular re- 
gion. Moreover, as the width of a recombination hotspot 
(1-2 kb) is significant with respect to length of the sliding 
windows (5 kb or 50 SNP) used in the analysis, the pres- 
ence of even a single hotspot can lower the LD of the 
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region covered within the window considerably. The 
enrichment of recombination hotspots, therefore sug- 
gests that LD blocks are probably shorter and that LD 
is probably lower in the CPS SNP-enriched regions 
compared to average genomic regions. Moreover, in 
addition to recombination rate associated SNP density 
variations, the high recombination rates also suggest 
that the effects of population admixtures will be more 
prominent in these regions, which might also be an import- 
ant source of the observed CPS SNP clusters. Furthermore, 
as recombination hotspots have been found to vary signifi- 
cantly among populations [77,78]. Therefore, if recombin- 
ation hotspots play a role in generating CPS SNP clusters 
the occurrence of these regions at different genomic posi- 
tions in different populations becomes explainable. 

Functional categories and pathway distribution of CPS SNPs 

To study the functional relevance of the CPS SNPs we 
analysed their localization with respect to known genes. 
As seen in the case of most novel variants identified by 
the 1000 Genomes project [17], as well as what can be 
expected on the basis of the background distribution of 
SNPs, most of the CPS SNPs were found to be either 
intergenic or intronic (Figure 7). Despite certain minor 
variations, for example in FIN and JPT, the overall dis- 
tribution of the CPS SNPs in different major genomic 
regions was observed to be similar in all the popula- 
tions. Interestingly however, the number of coding non- 
synonymous CPS SNPs in these populations (Table 4) 
were found be independent of the total number of CPS 
SNPs in them. These coding non-synonymous CPS SNPs 




Intergenic Intronic nCRNA Exonic Others 

Category 

Figure 7 Localization of common population-specific (CPS) SNPs 
in genomic regions defined on the basis of gene architecture. The 

majority of the CPS SNPs were found to be intergenic and intronic. 
The category ncRNA includes various types of non-coding RNAs and 
the category "other" includes upstream, downstream and UTR SNPs. 
The expected distribution based on overall occurrence of SNPs in 
human genome is shown as "Background". 



were found to occur in roughly equal numbers in YRI, 
LWK and JPT, only a single CPS SNP was detected in 
the IBS, and were missing in the FIN and CHINESE 
populations. The functional impact of these non-syn- 
onymous coding CPS SNPs was assessed using a com- 
bination of four different SNP function prediction 
tools (SIFT, Polyphen 2, LRT, Mutation taster) which 
predicted most of these SNPs to have a potential func- 
tional impact [79-82]. The list of coding non-synonymous 
SNPs along with their predicted functional significance is 
summarized in Table 4. 

Eleven coding non-synonymous CPS SNPs were ob- 
served in the YRI mapping tolO different genes, 8 of them 
were predicted to be functional by at least one of the tools. 
Four of the 10 CPS SNPs containing genes were detected 
to have known association with a disease, includin [79,80] 
g HLCS (holocarboxylase synthetase deficiency), TGM1 
(congenital ichthyosis), DIAPhl (deafness), and PAWR 
(which induces apoptosis in certain cancer cells). More- 
over, a functional SNP was detected in TRIMS which is a 
capsid-specific restriction factor involved in blocking viral 
replication early in the life cycle. Additionally, two coding 
non-synonymous SNPs were detected in the UPK3B gene 
which plays an important role in AUM-cytoskeleton inter- 
action in terminally differentiated urothelial cells. 

In the LWK population 12 coding non-synonymous 
CPS SNPs in 11 genes were observed, 5 of which are 
linked to disease phenotypes. These include ABCA4, 
linked to Stargardt disease 1, hereditary macular degener- 
ation and retinitis pigmentosa; ATP8B1, associated with 
various forms of cholestasis, GHR, which is linked to 
Laron syndrome, resulting in growth impairment; MCCC1, 
involved in methylcrotonoyl-CoA carboxylase 1 deficiency, 
and two SNPs in NLRP12 gene, which is associated with fa- 
milial cold autoinflammatory syndrome. In the JPT popu- 
lation 8 non-synonymous CPS SNPs, all of which were 
predicted to be functional, were observed in 8 genes. Some 
of these genes were found to be involved in melatonin 
activity, melanogenesis, olfaction and hair formation. 
Only a single non-synonymous CPS SNP was detected in 
the MRP35 gene in the IBS population, whereas none was 
found to occur in the CHINESE and the FIN populations. 

Additionally, a total of 520 CPS SNPs with probable 
consequences for gene regulation, all from RegulomeDB 
category 2, which demonstrates direct evidence of a 
binding through ChlP-seq and DNase data with either a 
matched position weight matrix to the ChlP-seq factor 
or a DNase footprint, were identified (Additional file 4) 
[83]. Of the putative regulatory variants identified, the 
majority are intergenic (234) and intronic (224). Approxi- 
mately 3 times as many upstream (24) compared to down- 
stream (7) variants were identified, while 3' -UTR variants 
were approximately double the number in the 5' -UTR. 
The occurrence of these potential regulatory SNPs, in 
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Table 4 Coding non-synonymous common population-specific SNPs and potential functional impact 


Pop 


SNP 


Gene 


SIFT 


PolyPhen-2 


LRT 


Mutation Taster 


IBS 


rs34804805 


MRPL35 


T 


B 


N 


N 


JPT 


rs3749130 


ARHGAP25 


D 


P 


N 


N 




rs2296151 


ASIP 


T 


P 


N 


N 




rs 17846992 


CCKAR 


D 


D 


N 


D 




rs77945315 


CSNK1E 


D 


B 


D 


D 




rs76875855 


KRT73 


D 


D 


D 


N 




rs 1800885 


MTNR1A 


T 


P 


U 


D 




rs4 1428447 


NDUFS2 


D 


B 


D 


D 




rs74548274 


OR5D13 


T 


P 


U 


N 


LWK 


rs61 749435 


ABCA4 


D 


B 


D 


N 




rs340 18205 


ATP8B1 


T 


B 


D 


D 




rs34744783 


C20orf26 


D 


B 


N 


N 




rs34347250 


EGLN3 


T 


B 


D 


D 




rs64 13484 


GHR 


D 


B 


N 


N 




rs34752664 


KCNF1 


T 


B 


D 


N 




rs35706839 


MCCC1 


T 


NA 


D 


D 




rs76085152 


NLRP12 


D 


NA 


N 


N 




rs 104895564 


NLRP12 


T 


D 


N 


N 




rs35651739 


NOX01 


D 


D 


N 


N 




rs3087400 


REV1 


T 


B 


N 


N 




rs34994431 


SLC16A1 1 


T 


D 


N 


D 


YRI 


rs35755269 


DIAPH1 


NA 


NA 


N 


P 




rs34901743 


HDAC3 


T 


D 


D 


P 




rs1 065759 


HLCS 


T 


P 


N 


D 




rs6299 


HTR1D 


D 


P 


N 


P 




rs8 176804 


PAWR 


T 


B 


N 


N 




rs34781001 


RPN1 


T 


P 


D 


D 




rs2229464 


TGM1 


T 


B 


N 


D 




rs59896509 


TRIM5 


D 


D 


D 


D 




rs1799126 


UPK3B 


D 


NA 


NA 


N 




rs1799125 


UPK3B 


T 


NA 


U 


N 




rs34995077 


ZNF565 


T 


P 


N 


N 



Functions were assessed using a set of 
B = Benign; P = Possibly Damaging; D = 
Non-synonymous SNP; N = Neutral; U = 



four different tools [79-82]. The predictions D and T for SIFT mean Deleterious and Tolerable respectively. For Polyphen2, 
Probably Damaging and NA refers to SNPs for which no information was found. Similarly for LRT; D = Deleterious 
Unknown and for MutationTaster; N = Polymorphism; D = Disease Causing; P = Polymorphism automatic. 



addition to the potentially functional coding non-syn- 
onymous CPS SNPs indicate that in spite of occurring 
in a single population, at least some of the CPS SNPs 
might play a significant functional role in some of these 
populations. 

To identify possible functional preference in the distri- 
bution of CPS SNPs in different populations we used the 
Ingenuity Pathway Analysis tool (IPA) [84] and DAVID 
[85] to identify functional classes, metabolic pathways 
and regulatory networks enriched in CPS SNPs in each 
population. The populations CEU, GBR and TSI, were 



excluded from this analysis as they contain too few 
CPS SNPs for generating statistically and biologically 
meaningful results. The top 5 canonical pathways found 
to be overrepresented in the CPS SNPs for each popula- 
tion using IPA are shown in Figure 8. We also prepared 
an extended gene list for each population which, in 
addition to genes for coding and intronic SNPs included 
nearby genes for the intergenic SNPs. This set was created 
to provide a more inclusive view of the functional prefer- 
ence as intergenic SNPs which form large proportion of 
CPS SNPs, are completely excluded from the pathway 
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Population 


Pathway name 


p-valuc 


Ncps 


N-ror 


VRI 


Axonal Guidance Signaling 


I.46F.-08 


85 


471 




Cardiac h-adrencrgic Signaling 


4.39E-06 


34 


158 




Nctrin Signaling 


2.56E-05 


15 


57 




Protein Kinase A Signaling 


2.77E-05 


68 


41)1 




Synaptic Long Term Depression 


2.9 1 1 -(15 


33 


160 


LWK 


Axonal Guidance Signaling 


5.06E-10 


1 19 


471 




CREB Signaling in Neurons 


2.65E-08 


59 


206 




1L-8 Signaling 


3.49E-07 


58 


208 




Neuropathic Pain Signaling In Dorsal Morn Neurons 
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Figure 8 Ingenuity canonical pathways enriched with common population-specific (CPS) SNPs. The 5 most overrepresented pathways for 
each population identified using IPA are shown. N C ps denotes the number of CPS SNP containing genes in the pathway and N TO t denotes the total 
number of genes in the pathway. Each pathway which was found to occur in two or more populations is shown in bold and a distinct colour. 



analysis. The top 5 CPS SNP-enriched canonical pathways 
for each population derived using the extended gene 
set are tabulated in Additional file 5. As expected, the 
pathways that were found using both the approaches 
show a significant overlap. Interestingly, there was a 
very significant overlap in pathways that were detected to 
be enriched in CPS SNPs between different populations. 
We also performed an analysis for enrichment of regula- 
tory networks in the CPS SNPs and their corresponding 
genes. Regulatory networks overrepresented in (a) CPS 
SNP containing genes and (b) extended gene list (list of all 
genes containing variants, as well as nearest neighbour 
genes for intergenic variants), for each population are 
summarized in Additional file 6 which also exhibited 
significant overlap between different populations. 

Using DAVID, we identified a number of CPS SNP- 
enriched disease, pathway, and gene ontology (GO) classes 
for each population. As observed for the pathways 
detected using IPA, the CPS SNP-enriched disease, 
pathway and GO classes identified using DAVID over- 
lapped between the different populations (Additional 
file 7). Moreover, the pathways identified using DAVID 
in many cases supported the pathways identified using 



the IPA tool. One of the major functional classes/pathways, 
which were observed to show significant CPS SNP enrich- 
ment in most of the populations and in multiple analyses, 
was the axon guidance signalling or axonogenesis path- 
way. This observation also supports previous work where 
genetic variations in genes involved in axon guidance sig- 
nalling have been found to show significantly high levels 
of population differentiation [86,87]. Moreover, a recent 
study aimed at identifying loci under parallel divergence 
(loci that have undergone moderate allele frequency 
changes in multiple independent human lineages) found 
most parallel divergent genes to occur in this pathway 
[88]. This may explain our observation for CPS SNP 
enrichment in the corresponding genomic regions in 
multiple populations. It is also interesting to note that 
several recent studies have shown this pathway to be 
one of the major mutational targets in pancreatic and 
other cancers [89-91]. It would be an interesting follow up 
study to probe whether evolutionary forces, like mutation 
rate, might contribute to the observed SNP accumulation 
in regions where genes for these pathways occur and 
whether this enrichment has any adaptive relevance. Similar 
overlap was observed in many other CPS SNP-enriched 
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pathways including protein kinase A signalling and CREB 
Signaling in Neurons (Figure 8), which points to underlying 
functional similarities in the distribution of CPS SNPs in 
different populations. 

Current functional and pathway analysis is clearly 
limited by the state of current knowledge about gene 
interactions and functions. Well studied genes and path- 
ways tend to contain more complete, validated interaction 
and functional data in contrast to less studied genes and 
pathways are. As the information around functional gene 
networks and regulatory pathways increases, we can 
anticipate that there may be additional gene functions 
and networks that are identified as being differentially 
regulated between populations; so these results can 
only represent our findings with respect to the current 
state of knowledge 

Conclusions 

In this study we have highlighted some interesting ob- 
servations with regard to population-specific genetic 
variation, using an unbiased data set generated by whole 
genome sequencing. Firstly, we showed that CPS SNPs are 
abundant but are not randomly distributed and can cluster 
into regions that can span up to several kilobases. Sec- 
ondly we have illustrated that at least some of the CPS 
SNPs are likely to have a phenotypic or functional impact. 
Thirdly, in terms of mechanism, we were unable to detect 
any evidence for selection in the regions of high CPS SNP 
density but interestingly, these regions more often associ- 
ate with regions of high recombination. The enrichment 
of recombination hotspots in a way also indicates that the 
LD in the CPS SNP-enriched region is lower than that in 
the average genome and rules out any possible role of LD 
in generating CPS enriched regions. Finally, functional en- 
richment analysis of the CPS SNPs and their associated 
genes has highlighted some interesting pathways and 
functions over represented in several populations. Particu- 
larly, it highlighted possible hyper mutability of genes in- 
volved in axonal guidance signalling perhaps suggesting 
some evolutionary plasticity in this pathway. 

Avenues for future exploration have been highlighted. 
However, there are several caveats. Firstly, the number 
of individuals per population for whom we have full 
genome sequences is presently low (N < 100). Secondly, 
the definition of a population in terms of origin and ad- 
mixture is at times vague and increased mobility world- 
wide leads to elevated levels of admixture. Moreover, 
the numbers of variants analysed is only a small subset 
(<1%) of all population-specific variants since rare variants 
(MAF < 0.05) have not been included. Genome sequen- 
cing of global populations is providing data which will as- 
sist in teasing out ancestral populations and will shed 
further light on population differentiation and adaptation. 
The availability of more extensive data along with an 



increased depth of sequencing, which permits the reliable 
study of rare genetic variants and structural variants, is 
therefore required for a better understanding of the 
relationship between unique genotypic variations and 
their geographical contexts. 

Methods 

Data retrieval and processing 

The recent version (Phasel, version 3, October 2012) of 
the 1000 Genomes vcf files containing phased genotypes 
for 36.7 million autosomal SNPs and 1.38 Million auto- 
somal SSVs were downloaded from 1000 Genomes Project 
ftp server [92]. The ancestral allele information for SNPs 
on the basis of multi species alignments, for all variants 
was also downloaded from the 1000 Genomes ftp site. 
The conversion of the 1000 Genomes data to PLINK 
format was performed using the VCF tools [93,94]. Fre- 
quency calculations and many other data manipulation 
operations were performed using PLINK [94]. The admixed 
populations (ASW, CLM, MXL and PUR) were excluded 
and the Chinese populations (CHB and CHS) were merged 
into a single population using PLINK which we refer to as 
"CHINESE". The SNPs were classified as common in a 
population if the MAF was observed to be greater than 
0.05 in that population. SNPs with lower MAF were 
treated as rare. 

Genomic distribution and regional enrichment analysis 

Identification of enrichment of CPS SNPs in genomic 
regions was performed using custom Perl scripts. We 
used two sliding window based approaches. In the first 
approach, each chromosome was scanned using sliding 
and non-overlapping 50-SNP windows and the frequency 
of CPS SNPs in each window was computed. Based on the 
overall occurrence of CPS SNPs in the entire chromosome 
the cumulative hypergeometric p-vahie for enrichment of 
CPS SNPs in each window was estimated. To correct for 
multiple hypothesis testing we used a conservative p-value 
cut-off of <5 x 10 ~ 8 for the identification of windows 
enriched with CPS SNPs. In the second approach we 
employed a similar scan using 5-kb non-overlapping 
windows. 

Selection scan 

Signatures of selection were evaluated using two different 
approaches. The haplotype homozygosity based iHS score 
was calculated using the WHAMM package [95]. As cal- 
culation of iHS requires physical positions to be specified, 
we downloaded the combined linkage physical map for 
human genome build GrCh37 from Rutgers Map [96] and 
incorporated the physical positions into the existing data. 
For each population, iHS scores for SNPs occurring in 
the 50-SNP windows which were found to be enriched 
in CPS SNPs in that population were calculated using 
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the iHS_calc script from the WHAMM package. To esti- 
mate the background iHS distribution for each population, 
we randomly sampled 10,000 50-SNP blocks and calcu- 
lated the iHS scores for the SNPs occurring in these 
blocks. Based on allele frequency bins derived from the 
background, the iHS scores were then standardized. As an 
extension of the iHS scores we also defined iHS enrich- 
ment scores (iES) scores which is the proportion of SNPs 
in each 50-SNP window which has | iHS | > 2. Windows 
showing the top 1%, 5% and 10% iES scores were respect- 
ively selected as three levels for the analysis. For each level 
the expected iES distribution in all CPS SNP windows 
of a population was estimated and compared to the actual 
distribution. Statistical significance of overrepresentation 
of iES scores in CPS SNP-enriched windows of a popu- 
lation was estimated using a /?-value calculated by a 
bootstrap resampling analysis. A similar analysis was 
also performed for CPS SNP-enriched 5-kb windows in 
each population. In addition, a separate set of analyses 
were performed for both 50-SNP and 5-kb windows, 
considering only SNPs with a minimum MAF of 0.05. 

The calculation of PBS was carried out following the 
methods proposed by Yi and colleagues [71]. For calculat- 
ing PBS scores for the African populations (YRI and LWK), 
JPT was used as an outlier. For the Asian populations 
CHINESE and JPT, YRI was used as an outlier. Similarly 
for the European populations (FIN and IBS) YRI was used 
as the outlier. For each three population set (like YRI- 
LWK-JPT or JPT-CHB-YRI) we estimated the background 
distribution of the PBS scores, using 10,000, randomly se- 
lected 50-SNP windows. We then identified score cut-offs 
based on the top 1%, 5% and 10% of the background dis- 
tribution and estimated the number of 50-SNP and 5-kb 
windows which can be expected to be in the top 1% ,5% 
and 10% PBS score range for a population. The number 
of observed windows in the 1%, 5% and 10% range was 
compared to the expected number and the corresponding 
j?-values were estimated using a bootstrap analysis. 

Recombination rate 

We retrieved the deCODE recombination map and the 
HapMap related recombination maps (hapMapRelea- 
se24YRIRecombMap and hapMapRelease24Combine- 
dRecombMap) using the UCSC table browser [97]. The 
distribution of recombination hotspots from the deCODE 
recombination map using a SRR (sex-standardized recom- 
bination rate) cut-off of 10 found only a few hotspots in 
the gene set and were not analysed further. 

The HapMap YRI recombination map (hapMapRelea- 
se24YRIRecombMap) was used to identify recombination 
hotspots and coldspots in YRI and the combined dataset. 
The distribution of recombination rates was studied to 
select genomic regions showing the top 1% recombin- 
ation rate scores and these regions were designated as 



recombination hotspots. We also used the top 5% recom- 
bination rate scores to select a second set of hotspots. 
Similarly, the two sets of coldspots likewise were defined 
by the lowest 1% and 5% recombination rates. Based 
on the genomic distribution of recombination rates in 
YRI (hapMapRelease24YRIRecombMap) we estimated 
the number of hotspot sites expected to occur in CPS 
SNP-enriched windows for YRI. The expected value 
was compared to the observed value and a cumulative 
hypergeometric />-value was used to estimate the stat- 
istical significance of the over and underrepresentation 
for recombination hotspots and coldspots in the CPS 
SNP-enriched 50-SNP windows and the CPS SNP enriched 
5 kb-windows in YRI. Similar analyses were conducted for 
all other populations, individually as well as combined to- 
gether, using the HapMap combined recombination map 
(hapMapRelease24CombinedRecombMap). 

SNP function assessment 

The genomic contexts of all CPS SNPs were determined 
using ANNOVAR [98], which was also used to annotate 
potentially functional non-synonymous variants based 
on their predicted functional impact at the protein level. 
ANNOVAR derives pre-computed functional impact 
scores for SIFT [80], POLYPHEN2 [79], LRT [82] and 
Mutation Taster [81]. Non-synonymous variants were con- 
sidered to have a functional impact if the recommended 
score criteria for any one of the algorithms were met, 
SIFT: > 0.95, POLYPHEN2: > 0.85, LRT > 0.5, Mutation 
Taster > 0.50. 

In order to identify non-coding CPS SNPs that may 
have an effect on the binding of regulatory factors, in- 
tronic variants and those flanking genes were searched 
against the RegulomeDB database [83], which employs a 
heuristic scoring system based on the confidence that 
the variant lies in a regulatory element and whether it 
has known or possible functional consequences such as 
alteration of Transcript Factor (TF) binding and changes 
in expression patterns of the associated gene(s). dbSNP 
[99] variants are classified into 6 categories, with category 
1 having highest confidence due to associated eQTL data, 
and category 6 the lowest. Only CPS SNPs belonging 
to categories 1 and 2 were considered to be regulation- 
modifying, since they are the most likely to result in a 
functional consequence. 

IPA analysis 

For each population, two gene lists were generated from 
the CPS SNP set. The first contained only genes that in- 
cluded selected variants, identified by rs IDs [99]. The 
second contained all genes that contained the identified 
SNPs, as well as nearest neighbour genes for the SNPs that 
were intergenic. By definition, the second list contained 
more genes than the first. Ingenuity Pathway Analysis (IPA) 
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software was used to analyse gene interaction networks 
in the gene lists, as well as enriched 'canonical' pathways 
describing well characterised and validated regulatory 
pathways [84]. 

DAVID analysis 

The Database for Annotation, Visualisation and Integrated 
Discovery (DAVID) [85] is an online tool that accepts a 
list of genes as input and performs functional analysis on 
them. It provides a list of functions enriched in the gene 
list, and clusters these functions according to their similar- 
ity. Functions include gene ontology (GO) and Swiss-Prot 
annotation, InterPro matches, OMIM [100] and other dis- 
ease links, as well as KEGG [101,102] and other pathway 
database links. The gene-enrichment analysis is based on 
the Fisher's Exact test, which determines whether or not a 
given list of genes is enriched for a certain function label 
or if this function occurs in the list by chance. A p-vahie 
shows the significance and adjusted /^-values are also pro- 
vided, after correction for multiple testing. The gene lists 
for each population that contained CPS SNPs were run 
through DAVID to identify overrepresented pathways and 
other functional labels. 

Function and disease association of CPS-SNP 
containing genes 

Potential functions of CPS-SNP containing genes and their 
role in various diseases were inferred from the GeneCards 
database [103]. 
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